Dragos Ailoae¶

2024-10-30

Regression with Binary Dependent Variables
¶

Application: Difference-in-Differences Regression
¶

The two 'differences' in the diference-in-differences (DiD) estimator are: (i) the difference in the means of the treatment and control groups in the response variable after the treatment, and (ii) the difference in the means of the treatment and control groups in the response variable before the treatment. The 'difference' is between after and before. Let us denote four averages of the response, as follows:

  • $\bar{y}_{T, A}$ Treatment, After
  • $\bar{y}_{C, A}$ Control, After
  • $\bar{y}_{T, B}$ Treatment, Before
  • $\bar{y}_{C, B}$ Control, Before

The difference-in-differences estimator $\hat{\delta}$ is defined as in: $$ \hat{\delta}=\left(\bar{y}_{T, A}-\bar{y}_{C, A}\right)-\left(\bar{y}_{T, B}-\bar{y}_{C, B}\right) $$

John Snow’s Cholera Hypothesis¶

$\quad$

Cholera Incidence per 10,000 (Snow 1855)
\begin{aligned} &\begin{array}{lcc} \textbf { Water Company Name } & \textbf {1849} & \textbf {1854} \\ \hline \text { Southwark & Vauxhall } & 135 & 147 \\ \text { Lambeth } & 85 & 19 \end{array}\\ \end{aligned}

Using the equation above then $\hat{\delta}= (19 - 147) - (85 - 135) = -78$ i.e. cholera incidence declined by 78 cases per 10,000 people.

Two-way fixed effects difference-in-difference estimator¶

Instead of manually calculating the four means and their difference-in-differences, it is possible to estimate the difference-in-differences estimator and its statistical properties by running a regression that includes indicator variables for treatment and after and their interaction term. The advantage of a regression over simply using the equation above is that the regression allows taking into account other factors that might influence the treatment effect. The simplest difference-in-differences regression model is presented below , where $y_{i t}$ is the response for unit $i$ in period $t$. In the typical difference-indifferences model there are only two periods, before and after.

The regression is:

$$ Y=\alpha_{g}+\alpha_{t}+\beta_{1} \text { Treated }+\varepsilon $$

where $\alpha_{g}$ is a set of fixed effects for the group that you're in - in the simplest form, just "Treated" or "Untreated" - and $\alpha_{t}$ is a set of fixed effects for the time period you're in - in the simplest form, just "before treatment" and "after treatment." Treated, then, is a binary variable indicating that you are being treated right now-in other words, you're in a treated group in the after-treatment period. The coefficient on Treated is your difference-in-differences effect.

Another way to write the same difference-in-difference equation if you have only two groups and two time periods is

$$ \begin{array}{r} Y=\beta_{0}+\beta_{1} \text { TreatedGroup }+\beta_{2} \text { AfterTreatment }+ \\ \beta_{3} \text { TreatedGroup } \times \text { AfterTreatment }+\varepsilon \end{array} $$

where TreatedGroup is an indicator that you're in the group being treated (whether it's before or after treatment is actually implemented), and AfterTreatment is an indicator that you're in the "post"-treatment period (whether or not yourgroup is being treated). The third term is an interaction term, in effect an indicator for being in the treated group AND in the post-treatment period, i.e., you're actually being treated right now. This third term is equivalent to Treated in the last equation, and $\hat{\beta}_{3}$ is our difference-in-differences estimate.

This interaction-term version of the equation is attractive because it makes clear what's going on. By standard interaction-term interpretation, $\beta_{3}$ tells us how much bigger the TreatedGroup effect is in the AfterTreatment than in the before-period. That is, how much bigger the treated/untreated gap grows after you implement the treatment. Difference-in-differences!

Whichever way you write the equation, this approach is called the “Two-way fixed effects difference-in-difference estimator” since it has two sets of fixed effects, one for group and one for time period.

Example: Card and Krueger (AER, 1994)¶

$\quad$

Full paper here

Understanding the geography¶

$\quad$

The regression¶

In the canonical DiD set-up (e.g. the Card and Kreuger minimum wage study comparing New Jersey and Pennsylvania) there are two units and two time periods, with one of the units being treated in the second period. Graphically, you can think of the relationship as the one presented below:

The following example calculates the DiD estimator for the dataset njmin3, where the response is fte, the full-time equivalent employment, $d$ is the after dummy, with $d=1$ for the after period and $d=0$ for the before period, and $n j$ is the dummy that marks the treatment group $\left(n j_{i}=1\right.$ if unit $i$ is in New Jersey where the minimum wage law has been changed, and $nj_{i}=0$ if unit $i$ in Pennsylvania, where the minimum wage law has not changed). In other words, units (fast-food restaurants) located in New Jersey form the treatment group, and units located in Pennsylvania form the control group.

In [1]:
options(warn = -1) # Suppress warnings
qui <- suppressPackageStartupMessages # quiet! - suppress library load messages
# install (if not already installed) and load package
qui(if(!require(stargazer)){install.packages('stargazer')})
qui(if(!require(modelsummary)){install.packages('modelsummary')})

#qui(if(!require(remotes)){install.packages('remotes')})
#remotes::install_github("ccolonescu/PoEdata")   # for njmin3 dataset

data("njmin3", package = "PoEdata")
names(njmin3)[names(njmin3) == 'd'] <- 'after'
names(njmin3)[names(njmin3) == 'd_nj'] <- 'nj_after'

head(njmin3)
tail(njmin3)
A data.frame: 6 × 14
njafternj_afterftebkkfcroyswendysco_ownedcentraljsouthjpa1pa2demp
<int><int><int><dbl><int><int><int><int><int><int><int><int><int><dbl>
110015.0010000100012.00
210015.00100001000 6.50
310024.00001001000-1.00
410019.25001010000 2.25
510021.5010000000013.00
6100 9.50010000000 1.00
A data.frame: 6 × 14
njafternj_afterftebkkfcroyswendysco_ownedcentraljsouthjpa1pa2demp
<int><int><int><dbl><int><int><int><int><int><int><int><int><int><dbl>
81501014.25001010001 -9.25
81601012.50001010001 -2.50
81701034.00001000001 16.00
81801010.00001010001-10.25
81901014.00100000010 -1.50
82001017.50100000010 -3.50
In [2]:
datasummary_skim(njmin3, output = "jupyter")
Unique (#) Missing (%) Mean SD Min Median Max
nj 2 0 0.8 0.4 0.0 1.0 1.0
after 2 0 0.5 0.5 0.0 0.5 1.0
nj_after 2 0 0.4 0.5 0.0 0.0 1.0
fte 118 3 21.0 9.4 0.0 20.0 85.0
bk 2 0 0.4 0.5 0.0 0.0 1.0
kfc 2 0 0.2 0.4 0.0 0.0 1.0
roys 2 0 0.2 0.4 0.0 0.0 1.0
wendys 2 0 0.1 0.4 0.0 0.0 1.0
co_owned 2 0 0.3 0.5 0.0 0.0 1.0
centralj 2 0 0.2 0.4 0.0 0.0 1.0
southj 2 0 0.2 0.4 0.0 0.0 1.0
pa1 2 0 0.1 0.3 0.0 0.0 1.0
pa2 2 0 0.1 0.3 0.0 0.0 1.0
demp 110 6 -0.1 9.0 -41.5 0.0 34.0
In [3]:
mod1a <- lm(fte ~ nj + after + nj * after, data = njmin3)
mod1  <- lm(fte ~ nj * after, data = njmin3)    # or can just include the cross-terms and R adds the rest

stargazer(mod1a, mod1, type = "text")
===========================================================
                                   Dependent variable:     
                               ----------------------------
                                           fte             
                                    (1)            (2)     
-----------------------------------------------------------
nj                                -2.892**      -2.892**   
                                  (1.194)        (1.194)   
                                                           
after                              -2.166        -2.166    
                                  (1.516)        (1.516)   
                                                           
nj:after                           2.754          2.754    
                                  (1.688)        (1.688)   
                                                           
Constant                         23.331***      23.331***  
                                  (1.072)        (1.072)   
                                                           
-----------------------------------------------------------
Observations                        794            794     
R2                                 0.007          0.007    
Adjusted R2                        0.004          0.004    
Residual Std. Error (df = 790)     9.406          9.406    
F Statistic (df = 3; 790)          1.964          1.964    
===========================================================
Note:                           *p<0.1; **p<0.05; ***p<0.01

The coefficient on the term $n j: after$ is $\delta$, our difference-in-differences estimator. If we want to test the null hypothesis $H_{0}: \delta \geq 0$, the rejection region is at the left tail; since the calculated $t$, which is equal to $1.631$ is positive, we cannot reject the null hypothesis. In other words, there is no evidence that an increased minimum wage reduces employment at fast-food restaurants.

In [4]:
alpha <- 0.05

CV.onetailed <- qt(alpha , mod1$df.residual)
paste0("The one tail 95% critical value for a t-distribution with ", mod1$df.residual," degrees of freedom is")
round(CV.onetailed,3)

CV.twotailed <- qt(c(alpha/2, 1 - alpha/2) , mod1$df.residual)
paste0("The two tail 95% critical value for a t-distribution with ", mod1$df.residual," degrees of freedom is")
round(CV.twotailed,3)
'The one tail 95% critical value for a t-distribution with 790 degrees of freedom is'
-1.647
'The two tail 95% critical value for a t-distribution with 790 degrees of freedom is'
  1. -1.963
  2. 1.963

The effect is stronger when we control for restaurant type and location:

In [5]:
mod2 <- lm(fte ~ nj * after +
             kfc + roys + wendys + co_owned, data = njmin3)

mod3 <- lm(fte~nj * after+
             kfc + roys + wendys + co_owned +
             southj + centralj + pa1, data = njmin3)

stargazer(mod1, mod2, mod3, type = "text")
========================================================================================
                                            Dependent variable:                         
                    --------------------------------------------------------------------
                                                    fte                                 
                            (1)                   (2)                     (3)           
----------------------------------------------------------------------------------------
nj                       -2.892**              -2.377**                  -0.908         
                          (1.194)               (1.079)                 (1.272)         
                                                                                        
after                     -2.166                -2.224                   -2.212         
                          (1.516)               (1.368)                 (1.349)         
                                                                                        
kfc                                           -10.453***               -10.058***       
                                                (0.849)                 (0.845)         
                                                                                        
roys                                            -1.625*                 -1.693**        
                                                (0.860)                 (0.859)         
                                                                                        
wendys                                          -1.064                   -1.065         
                                                (0.929)                 (0.921)         
                                                                                        
co_owned                                        -1.169                   -0.716         
                                                (0.716)                 (0.719)         
                                                                                        
southj                                                                 -3.702***        
                                                                        (0.780)         
                                                                                        
centralj                                                                 0.008          
                                                                        (0.897)         
                                                                                        
pa1                                                                      0.924          
                                                                        (1.385)         
                                                                                        
nj:after                   2.754                2.845*                   2.815*         
                          (1.688)               (1.523)                 (1.502)         
                                                                                        
Constant                 23.331***             25.951***               25.321***        
                          (1.072)               (1.038)                 (1.211)         
                                                                                        
----------------------------------------------------------------------------------------
Observations                794                   794                     794           
R2                         0.007                 0.196                   0.221          
Adjusted R2                0.004                 0.189                   0.211          
Residual Std. Error  9.406 (df = 790)      8.484 (df = 786)         8.367 (df = 783)    
F Statistic         1.964 (df = 3; 790) 27.448*** (df = 7; 786) 22.265*** (df = 10; 783)
========================================================================================
Note:                                                        *p<0.1; **p<0.05; ***p<0.01